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ABSTRACT 

Gene families often show degrees of differences in 
terms of exon-intron structures depending on their 
distinct evolutionary histories. Comparative analysis 
of gene structures is important for understanding 
their evolutionary and functional relationships 
within plant species. Here, we present a compara- 
tive genomics database named PIECE (http://wheat. 
pw.usda.gov/piece) for Plant Intron and Exon 
Comparison and Evolution studies. The database 
contains all the annotated genes extracted from 25 
sequenced plant genomes. These genes were clas- 
sified based on Pfam motifs. Phylogenetic trees 
were pre-constructed for each gene category. 
PIECE provides a user-friendly interface for different 
types of searches and a graphical viewer for 
displaying a gene structure pattern diagram linked 
to the resulting bootstrapped dendrogram 
for each gene family. The gene structure evolution 
of orthologous gene groups was determined using 
the GLOOME, Exalign and GECA software programs 
that can be accessed within the database. PIECE 
also provides a web server version of the software, 
GSDraw, for drawing schematic diagrams of gene 
structures. PIECE is a powerful tool for comparing 
gene sequences and provides valuable insights 
into the evolution of gene structure in plant 
genomes. 

INTRODUCTION 

In eukaryotes, a typical gene structure contains two 
elements: the exon and the intron. Exons are the DNA 
sequences that are transcribed and represented in the 
mature forms of RNA (mRNAs) that serve as template 
for synthesis of the encoded proteins. Introns that 



interrupt the exons in gene sequences are also transcribed, 
but they are removed from the mature RNA transcript by 
RNA splicing. Comparative analysis of exon-intron 
organization is important for understanding rules of 
gene structure and organization, protein functionality 
and evolutionary changes among species. The structural 
information of genes and gene families can serve as 
material for phylogenetic analyses to understand the 
gain, loss and change of gene structures (1-3), thereby 
elucidating mechanisms underlying the molecular evolu- 
tion of genes and genomes (4-6). The increasing availabil- 
ity of plant genome sequences now makes it possible to 
conduct phylogenetic analyses of genes or gene families 
from a large number of plant species representing a 
large evolutionary distance. Typically, phylogenetic 
analyses of genes of interest require, first, the extraction 
of genes with corresponding intron and exon structure 
information, followed by phylogenetic analyses using 
available software programs. Comparing gene sequences 
to identify evolutionarily conserved gene structures is 
useful for predicting the biological function of protein- 
coding genes of interest. Accordingly, some plant com- 
parative genomic databases, such as PlantGDB (7), 
PLAZA (8) and Phytozome (9), are well known and 
widely used because these databases allow users to 
extract gene structure data including exon-intron pos- 
itions, exon and intron lengths and alternative splicing. 
Usually, users will still need to perform further analyses 
on the extracted data with available software programs to 
gain insight regarding the evolution and function of gene 
structure. Databases dealing with gene structure analyses 
are available, but with a primary emphasis on non-plant 
species. CIWOG is a plant database that displays common 
introns within orthologs in eight plant species (10). 
Furthermore, in most cases, phylogenetic trees, gene struc- 
tures, protein domains and exon-intron comparisons for 
orthologs have not yet been integrated together and there- 
fore the related databases do not provide a comprehensive 
view pertinent to evolution and function of gene structure. 
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For instance, these databases do not contain information 
regarding which Pfam domain in a gene family contains 
conserved intron sites and phases. The location of introns 
with exons occurs as one of three different phases; between 
two codons (phase 0), between the first and second nucleo- 
tides of a codon (phase 1) or between the second and third 
nucleotide of a codon (phase 2). Intron phases are a 
conservative character of eukaryotic gene structures 
because any phase change requires either compensatory 
double mutations or a more complex molecular mechan- 
ism. Therefore, the location of the introns within the same 
sites and phase of related genes is strong support for an 
evolutionary relationship. Meanwhile, it is important to 
understand that the evolution of gene structure is often 
associated with the evolutionary history and functional 
domains of a gene of interest. 

Here, we report the development of PIECE (http:// 
wheat.pw.usda.gov/piece), a comprehensive plant gene 
comparison and evolution database containing all the 
annotated genes described from 25 plant species with avail- 
able sequenced genomes. The database includes data for 
17 eudicots, 5 monocots, 2 green algae and the moss 
Physcomitrella patens (Supplementary Table SI). The 
annotated genes were extracted from each species and 
classified based on their Pfam motif (11). Phylogenetic 
trees were pre-constructed for each gene category by 
integrating exon-intron and protein motif information. 
The intron site data can be shown not only in the 
genomic sequence but also in protein alignment sequences. 
The sequence and gene structure information for each 
identified gene is available for online access within the 
PIECE website. The database contains orthologs in 
those species for comparative analysis and evolutionary 
studies of gene structure. Several gene structure analysis 
software tools including GLOOME (12), Exalign (13) and 
GECA (14) have been integrated into PIECE and can be 
executed for each orthologous group to display exon- 
intron gain, loss and conservation. PIECE also provides 
a web interface package, GSDraw (Gene Structure Draw 
Server), for drawing schematic diagrams of the structures 
of genes derived from other species in addition to the 25 
sequenced plant species. Users can submit genomic coding 
DNA sequence (CDS) and transcript sequences. GSDraw 
uses this information to obtain the gene structure, protein 
motif and phylogenetic tree and outputs the results as 
diagrams. PIECE can provide valuable information for 
plant researchers for analyzing the evolution of gene struc- 
ture and for elucidating the biological function of 
proteins. PIECE is a useful resource for the research com- 
munity, particularly for the study of exon-intron 
evolution. 



DATABASE CONSTRUCTION 

Data collection 

PIECE currently contains a total of 947 630 annotated 
genes from 25 sequenced plant species including 
low plants to Angiosperms (Supplementary Table SI). 
Genome sequences, transcript sequences, protein se- 
quences and annotation GFF files were downloaded 



from Phytozome (9). Exon-intron site, length and intron 
phase data were extracted based on the genome annota- 
tion GFF files using an in-house Java program. 

Plant gene family classification 

Plant genes were grouped into different families based on 
their protein domains using the Pfam database (v26.0) 
(11). We applied the hmm search program in the 
HMMER package (15) to search against the protein se- 
quences of each species to classify genes. An is-value 0.01 
as a cutoff, which has been widely adopted for HMMER 
searches, was used for queries. Many genes have more 
than one Pfam domain. For example, the B3 domain 
(PF02362) is present in either the ABI3-VP1 family or 
the RAV subfamily of the AP2 family. In this case, we 
therefore assigned PF02362 as a gene family entry that 
includes genes in the ABI3-VP1 and AP2 families. 

Multiple sequence alignment and phylogenetic analysis 

Multiple sequence alignment (MSA) was performed using 
the MUSCLE v3.831 program (16). The default param- 
eters were used if the number of members in a gene family 
was <500, otherwise the option L -maxiters 2' was applied. 
For phylogenetic analyses, the FastTree v2.1.4 program 
was used (17), which implements a fast and accurate ap- 
proximate maximum likelihood method. FastTree 
analyses were conducted with default parameters; specif- 
ically, the amino acid substitution matrix used was JTT, 
the number of rate categories of sites (CAT model) was 20, 
the local support values of each node were computed by 
re-sampling the site likelihoods 1000 times. 

Putative ortholog annotation 

To predict putative orthologous relationships of genes 
among these plant species, we used the BLAST score 
ratio (BSR) method, which has been widely adopted by 
ENSEMBL and other studies (18). An all-against-all 
BLASTP search with a strict cutoff i?-value <le— 20 was 
performed, and the BSR value was calculated for each hit. 
After comparing results at different BSR values, we chose 
a BSR value >0.4 as the cutoff and retrieved the top se- 
quences in the species with the largest BSR values as the 
putative ortholog(s). 

Orthologous gene structure evolution 

PIECE uses GLOOME, Exalign and GECA for 
the orthologous gene structure evolution analysis. 
GLOOME can analyze the presence and absence profiles 
(phyletic patterns), which are widely used in biology (12). 
The default parameter settings were used for GLOOME 
analyses. Because the required input is a phyletic pattern 
provided as a 0/1 MSA, we first used MUSCLE to obtain 
the alignment of orthologous protein sequences with 
default parameters, and then calculated all intron sites. 
For each gene, if it has an intron site in the aligned con- 
sensus sequence, we marked T for the site, if not, we 
marked '0' for the site. To obtain orthologous gene 
exon-intron gain and loss information, a Java pipeline 
was implemented to include the steps described above, 
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i.e. gene family classification, ortholog annotation and 
orthologous gene structure analysis. 

We performed alignments between intron/exon struc- 
tures using the Exalign algorithm (13). The algorithm 
was run in global alignment mode, and allowed intron 
gain/loss detection to exclude false assignments because 
of intron gain/loss events between orthologs. We created 
the Exalign dataset for each plant species in our database, 
and compared the full set of plant gene structures. The 
length of partially coding exons was adjusted to include 
only the coding portions. Fully non-coding exons were 
excluded from the comparison. 

Recently, a new tool named GECA was developed, 
which displays gene exon-intron organization by high- 
lighting changes in gene structure among members of a 
gene family (14). In PIECE, orthologs can also be dis- 
played using the GECA method with default settings. 

UTILITY AND WEB INTERFACE 

PIECE is a web-based tool combining a MySQL database 
management system with a dynamic web interface based 
on PHP and Javascript. The exon-intron data in the 
database are searchable and viewable. 

Search system 

PIECE has a user-friendly entry point for searching each 
gene. Users can retrieve any gene by a keyword search for 
gene ID, gene name or gene function or by a BLAST 
search using either the nucleotide or protein sequence of 
your gene of interest (Figure 1A). The main page of the 
search results lists all genes meeting the search criteria and 
provides brief information, such as gene accession, gene 
description, source organism, gene annotation, Pfam 
domain and gene ortholog analysis (Figure IB). The 
columns in Gene ID, Pfam ID and Ortholog Gene 
Structures are linked to more detailed information of the 
analysis results. For example, clicking on a gene accession 
will display details on the exon-intron information for 
each gene, including the genomic and transcript sequences 
(Figure 1C). Each gene in the search result contains the 
location of the Pfam domain that was identified in its 
protein sequence. Clicking on a Pfam domain will show 
its phylogenetic tree along with gene structures 
(Figure ID). The ortholog analysis link will display the 
GLOOME, Exalign and GECA results of the gene 
(Figure IE and F), which link to details that include 
more elaborate descriptions of the orthologous gene struc- 
ture evolution results. The detail of the gene structure and 
evolution analysis in the Pfam ID and Ortholog Gene 
Structures columns are also presented (see below). 

Phylogenetic tree display along with gene structure 
and Pfam domains 

The database provides a user-friendly graphical view that 
displays SVG-formatted output, which contains a gene 
structure and Pfam domain pattern diagram linked to 
a bootstrapped similarity dendrogram (Figure 2). 
Depending on the annotations present in the database, 
the viewer can automatically recognize elements of the 



gene structure, such as coding exons, introns and UTRs. 
Default conventions are used to render exons (thick 
boxes), UTRs (thin blue and green boxes) and introns 
(thin grey boxes), but the user can modify the display of 
the elements by selecting a different color or choosing to 
not display the element. A search function is provided to 
allow users to search the gene ID in he phylogenetic tree. 
If the ID is found, it will be highlighted in red. Controls 
available on the bottom of the page allow magnification of 
tree image (e.g. the 'zoom in' and 'zoom out' buttons) as 
well as movement of the magnified image with the arrow 
buttons. When viewing the gene structure, the exons, 
introns and Pfam domains for genes can be selected 
easily. When the user hovers the mouse over each 
element, the length of the element will be shown. By 
clicking the element, the sequence information for the 
selected element will be displayed. As a demonstration, 
the analysis results for the Lipoxygenase gene family are 
presented in Figure 2. 

Multiple types of gene structure display 

Gene structure visualization is important for analyzing 
exon-intron evolution. Typically, the basic components 
of gene structure (UTRs, intron, exons) are displayed on 
genomic sequence (19,20). To find relationships between 
exon-intron compositions in the encoded proteins, exon 
boundaries are also mapped onto the protein sequence 
(21,22). The view function in PIECE provides three 
types of exon-intron displays for each Pfam domain. 
Users can select any protein domain of interest by 
clicking the Pfam ID in the search results. 

Analysis of gene structure evolution with groups 
of orthologs 

On a gene family scale, global analysis is useful for dating 
intron changes; however, for certain genes, gene structure 
evolution in different species is not clear in phylogenetic 
trees with exon-intron pattern diagrams. Moreover, not 
all genes contain Pfam motifs in their encoding proteins 
and therefore cannot be analyzed as in Figure 2. 
Consequently, it is necessary to show exon-intron fluctu- 
ations for each gene in the database because intron- 
containing genes are spread across diverse plant phyla, 
whereas orthologs often have similar exon-intron organ- 
ization even at large evolutionary distances (23). 
PIECE provides three analysis methods for each gene to 
infer the evolution of exon-intron structure in multiple 
protein-coding ortholog sets along a fixed-species 
phylogeny. 

GLOOME 

To analyze the gain and loss of introns in the ortholog 
group, we used GLOOME, which accurately infers 
branch-specific and site-specific gain and loss events with 
presence and absence profiles. To integrate GLOOME 
into PIECE, we first aligned the protein sequences 
within the ortholog group. We next coded intron charac- 
teristics using binary characters to denote presence (T) 
and absence ('0'). The 0/1 matrix, in which rows 
correspond to species and columns corresponds to 
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Figure 2. The PIECE viewer. Data for the LOX gene family (PF00305) in Arabidopsis thaliana, rice, poplar and Chlamydomonas reinhardtii. (A) 
Dendrogram of sequences clustered according to the presence and similarity of extracted Pfam motifs. (B) Diagram that displays positional infor- 
mation of the gene structure in each sequence. (C) Color selector and check boxes for displaying introns, CDS, UTRs, Pfam domains and intron 
phases, and save button to save the output as a PNG file. (D) Color for the plant species and operation panel for manipulating the output. 



binary characters, is termed a phylogenetic profile of 
presence-absence or phyletic pattern and is equivalent to 
a MSA. In PIECE, the output of GLOOME includes 
plant species trees, gene structure displays, intron site 
sequence logos and the expected number of gains and 
losses for each intron site (Figure IE). When users 
click each box in the histogram, the viewer will show the 
intron site in the aligned protein sequence. The alignments 
were generated using MUSCLE, and the sequence 
alignment graphical display was implemented in the 
Jalview (24) Java applet. GLOOME provides useful 
analytical facilities for exploring the degree of conserva- 
tion of intron evolution across proteins in the ortholog 
group and also for analyzing the distribution of 
exonic sequences within the aligned coding sequences of 
domains. 

Exalign 

During evolution, one exon may split into multiple exons 
or multiple exons may fuse into one; such events have 
stringent constraints in exon length, and this characteristic 
can be used to determine cases of exon fusion or division. 
To analyze the evolution of gene structure of orthologs, 
we use another tool named Exalign (13). The Exalign 
viewer of PIECE can show the relationship of exons in 
orthologous genes from different plant species. This 
viewer provides exon-intron display for orthologs of 
gene structure data sets linked to the species phylogeny 
(Figure IF). The gene-exon comparison between species 
is shown as colored lines. Different colors indicate differ- 
ent exon comparison results. In PIECE, any gene data 
with its set of orthologs can be put into the Exalign 
viewer at the user's request to easily find the evolution 
history of genes and, particularly, to detect exon relation- 
ships and fusion events. 



GECA 

Aligning exon-intron structures accompanied with 
similarities between sequences is helpful for annotating 
gene structure information. GECA can analyze gene 
exon-intron organization and highlight changes in gene 
structure (14). GECA relies on protein alignments, 
completed with the identification of common introns in 
corresponding genes using CIWOG (Common Introns 
Within Orthologous Genes) (10). In PIECE, each gene 
has a GECA link to view the orthologs that are aligned 
using their common introns detected by CIWOG. The 
similarities between orthologous sequences in the align- 
ment are represented at the level of amino acids in the 
translated exons. A blue line links two amino acids if 
they are identical, a purple line indicates conservative sub- 
stitutions, and intron type is detected by CIWOG 
(Figure 1G). 

GSDraw web server 

A number of web tools have been developed for gene 
structure annotation, such as GSDS (19), FancyGene 
(20) and GECA (14). The purpose of these programs is 
to represent the exon-intron structure of several genes in a 
single image to perform global gene structure comparisons 
(14). However, these resources do not display sequences 
with phylogenetic relationships and automatically 
detected protein motifs. Therefore, we developed 
GSDraw as part of PIECE. GSDraw is a convenient 
and easy-to-use interface for gene structure annotation 
that integrates Sim4 (25), MEME (26), MUSCLE (16) 
and FastTree (17) into a single web-based tool. The pro- 
cedures for designing and implementing the GSDraw 
server are illustrated in Figure 3. Users submit a query 
sequence set (in multi-FASTA format) consisting of 
genomic, CDS or transcript sequences to GSDraw 



D1164 Nucleic Acids Research, 2013, Vol. 41, Database issue 



Genomic 
sequence 



Transcript 
sequence 



CDS sequence 




Translate 



SIM4 



Exon, intron, UTR and intron phase information 



^ Protein sequence J 

^ MEME 

' Motif in the sequence J 



I 



MUSCLE 



Protein and motif sequence alignment 




Multiplegene structure displayin PIECE 
Figure 3. Workflow chart of GSDraw. 



FastTree 



Phylogenetictree 



(http://wheat.pw.usda.gov/piece/GSDraw.php) and 
obtain schematic diagrams of their gene structures with 
annotated Pfam protein motifs and a phylogenetic tree. 
This capability allows users to view a PIECE 
database-style display for any selected gene family group 
(of three or more genes) from any species with available 
data. The GSDraw output for three rice LRR-Kinase 
genes is shown in Supplementary Figure SI. The user 
can modify the gene structure display to their own prefer- 
ences by selecting different colors for the annotated 
sequences and/or choose whether or not to display each 
of the Pfam motifs, similar to what is allowed in the 
PIECE viewer. 



DISCUSSION 

Simple sequence alignment and comparison usually is 
unable to provide a clear picture of the structural evolu- 
tion of genes, e.g. how their intron-exon structures, intron 
lengths, alternative splicing and untranslated regions 
change over time. Although there has been a rapid 
growth in the number of plant genome databases, such 
as PlantGDB (7), PLAZA (8), Phytozome (9) and 
GreenPhylDB (27), these resources lack comparative 
analytical capabilities for integrating protein domains 
from multiple species to investigate exon-intron structural 
evolution. ExDom (28) contains an extensive collection 
of exon-intron gene structures mapped to protein 
domains, but it primarily focuses on non-plant species. 



Furthermore, most related databases do not display the 
phylogenetic tree of gene families and orthologous gene 
evolution histories. To address these limitations, we de- 
veloped PIECE, which characterizes the number, position 
and length of introns and exons from 25 individual 
sequenced plant genomes. The PIECE database provides 
a panoramic perspective from which to investigate the 
evolution of gene structures on a broad evolutionary 
time scale. Furthermore, PIECE provides an easy entry 
point for researchers to immediately access gene structure 
evolution information without having to install any 
software. 

For example, heat shock response in eukaryotes is tran- 
scriptionally regulated by conserved heat shock transcrip- 
tion factors (Hsfs). Hsf genes are represented by a large 
multigene family in plants. To illustrate the possible mech- 
anisms of structural evolution of Hsf homologs, we used 
PIECE to compare the exon-intron structures of individ- 
ual Hsf genes in 10 plant lineages. Supplementary Figure 
S2 provides a detailed illustration of the relative length of 
introns and the conservation of the corresponding exon 
sequences within each of the Hsf genes. Notably, although 
the members of the Hsf gene family exhibited differences 
in intron number and intron length, the intron positions 
and intron phases were remarkably well conserved, with 
conserved splicing sites between adjacent exons. 

To further investigate the structural evolution of Hsf 
genes in different lineage species, we also used PIECE to 
create images that contain gene structure information in 
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Figure 4. An evolutionary model for the structural evolution of the Hsf gene family in plants. (A) Dendogram representing the evolutionary 
relationship of all plant lineages. (B) Proposed exon-intron structure of the ancestral Hsf gene in each plant lineage. (C) Current exon-intron 
structure of Hsf genes. The exon-intron structure of the Hsf genes in gymnosperms is represented with a empty bar because genomic sequences are 
unavailable. 



unaligned and aligned protein sequences (Supplementary 
Figures S3 and S4). We next constructed an evolutionary 
model that could predict the current Hsf genes in plant 
species of different lineages (Figure 4). Under the assump- 
tion that introns, which were located at identical positions 
and given identical phase, should be present in the 
common ancestor, we reconstructed the ancestral exon- 
intron structure of Hsf for all plant lineages (Figures 4). 
The results obtained from the Hsf intron analysis sug- 
gested that the ancestral Hsf contained >12 introns, sym- 
metrically distributed throughout its coding sequence 
(Figure 4). The aquatic plants (green algae) have a large 
number of introns. Most introns were lost in the evolution 
of aquatic plants (green algae) to lower land plants 
(mosses and lycophytes), including I 2 , I3, I5, 16, Is and 
I x 0 1 1 2 ■ Moreover, single intron losses also occurred 
during the expansion and divergence of the Hsf gene 
family in each plant lineage. For example, the ancestral 
Hsf 'in monocots contained at least 3 introns, whereas all 
Hsf genes in monocots contained only 1 or 2 introns 
(Figure 4B and C). It appears that I 7 and I 9 are not 



present in monocots, but are present in the dicot 
ancestor. Besides the intron loss, gain of an intron is 
also observed. Ij is only present in angiosperms. 
Furthermore, the analysis revealed that I 4 is present in 
the Hsf gene of the common ancestor of all plant 
lineages (Figure 4B), and its position is in the DNA 
binding domain (DBD) (Figure 4C). This observation in- 
dicates that the Hsf family in plants not only has a 
conserved DBD motif but also contains a conserved 
intron in the DBD domain. 

As the examples demonstrate, the capabilities of PIECE 
will provide researchers with many hypotheses for design- 
ing molecular biology studies and will help to elucidate the 
evolutionary history of plant genes. Future efforts will 
extend the number of available plant species and 
enhance the analytical capabilities of PIECE. Newly pub- 
lished plant genomes will enable efficient phylogenetic 
analyses of exon-domain relationships in plants and 
in-depth analysis of the evolutionary history of protein 
domains. Alternative splicing is an important biological 
process that greatly increases the biodiversity of proteins 
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that can be encoded by the genome. One of the future 
directions will focus on the integration of alternative 
splice data into PIECE for gene evolution and structure 
analyses. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Table 1 and Supplementary Figures 1-4. 
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